· Calculate means, SDs, and confidence intervals (89%, 95%, 99%) for one continuous and one binary variable. Interpret the confidence intervals.
· Do two simulations (one continuous, one binary) to show that simulation-based standard deviations of the estimate converge to the formula-based standard error of the sampling distribution. Explain the result to show you understand what you did.
The continuous variable I choose is the life Expectancy from dataset Gapminder.
Mean of LifeExp = 59.47
Standard Deviation of lifeExp = 12.92
Standard Error = \(\frac{SD}{\sqrt{n}}\) = 0.3129845
89% confidence interval = (58.97 , 59.97)
For the population life expectancy, the probability that the expectancy is within range 58.97 to 59.97 years is 0.89.
95% confidence interval = (58.86 , 60.09)
For the population life expectancy, the probability that the expectancy is within range 58.86 to 60.09 years is 0.95.
99% confidence interval = (58.67 , 60.27)
For the population life expectancy, the probability that the expectancy is within range 58.67 to 60.27 is 0.99.
If we randomly select 150 units from eligible total population, and repeat the selection for80, 580 and 1080 times. We will get 80, 580, 1080 samples with 150 units each, we calculate mean of each sample group and get a distribution of means of each sample group.
The mean of sample means is 59.55
The standard deviation of the sample means is 0.96
SE = \(\frac{0.96}{\sqrt{80}}\) = 0.107
The mean of sample means is 59.44
The standard deviation of the sample means is 1.04
SE = \(\frac{1.04}{\sqrt{580}}\) = 0.043
The mean of sample means is 59.44
The standard deviation of the sample means is 1
SE = \(\frac{1}{\sqrt{1080}}\) = 0.030
The larger the sample size, the smaller the standard error.
SE80times > SE580times > SE1080times.
This demonstrates that as the sample size increases, the estimated mean more closely converges to the true population mean.
Suppose we have a partial population of 10000, among which 8000 are older or equal to 18 years old and the rest 2000 are minors. We assign older with 1, and minors with 0.
Then we create a Bernoulli distribution with p = 0.8.
Mean = 0.8
For variable,“Adults or Minors”, Var[X] = p(1-p)= 0.8*0.2 = 0.16
SD[X] = \(\sqrt{p(1-p)}\) = \(\sqrt{0.16}\) = 0.4
SE = \(\frac{SD}{\sqrt{n}}\) = 0.4/100 = 0.004
89% confidence interval = mean +/- 1.598SE = 0.8 +/- 0.006 = (0.794, 0.806)
The probability of the true proportion of adults in the population to be within range 0.794 to 0.806 is 0.89.
95% confidence interval = mean +/- 1.98SE = 0.8 +/- 0.008 = (0.792, 0.808)
The probability of the true proportion of adults in the population to be within range 0.792 to 0.808 is 0.95.
99% confidence interval = mean +/- 2.58SE = 0.8 +/- 0.010 = (0.790, 0.810)
The probability of the true proportion of adults in the population to be within range 0.790 to 0.810 is 0.99.
If we randomly select 500 people from the population and count p for this sample. We draw 20, 200, 2000 times.
The sample mean is 0.8039.
Standard deviation is 0.0140447
Standard Error is \(\frac{0.0140}{\sqrt{20}}\) = 0.0031
The sample mean is 0.80064.
Standard deviation is 0.0169525
Standard Error is \(\frac{0.0170}{\sqrt{200}}\) = 0.0012
The sample mean is 0.800022.
Standard deviation is 0.0175954
Standard Error is \(\frac{0.0176}{\sqrt{2000}}\) = 0.0004
The larger the sample size, the smaller the standard error.
SE20times > SE200times > SE2000times.
This demonstrates that as the sample size increases, the estimated mean more closely converges to the true population mean.